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Abstract 

Background: Dimensionality reduction (DR) enables the construction of a lower dimensional space (embedding) 
from a higher dimensional feature space while preserving object-class discriminability. However several popular DR 
approaches suffer from sensitivity to choice of parameters and/or presence of noise in the data. In this paper, we 
present a novel DR technique known as consensus embedding that aims to overcome these problems by 
generating and combining multiple low-dimensional embeddings, hence exploiting the variance among them in a 
manner similar to ensemble classifier schemes such as Bagging. We demonstrate theoretical properties of 
consensus embedding which show that it will result in a single stable embedding solution that preserves 
information more accurately as compared to any individual embedding (generated via DR schemes such as 
Principal Component Analysis, Graph Embedding, or Locally Linear Embedding). Intelligent sub-sampling (via mean- 
shift) and code parallelization are utilized to provide for an efficient implementation of the scheme. 

Results: Applications of consensus embedding are shown in the context of classification and clustering as applied 
to: (1) image partitioning of white matter and gray matter on 10 different synthetic brain MRI images corrupted 
with 18 different combinations of noise and bias field inhomogeneity, (2) classification of 4 high-dimensional gene- 
expression datasets, (3) cancer detection (at a pixel-level) on 16 image slices obtained from 2 different high- 
resolution prostate MRI datasets. In over 200 different experiments concerning classification and segmentation of 
biomedical data, consensus embedding was found to consistently outperform both linear and non-linear DR 
methods within all applications considered. 

Conclusions: We have presented a novel framework termed consensus embedding which leverages ensemble 
classification theory within dimensionality reduction, allowing for application to a wide range of high-dimensional 
biomedical data classification and segmentation problems. Our generalizable framework allows for improved 
representation and classification in the context of both imaging and non-imaging data. The algorithm offers a 
promising solution to problems that currently plague DR methods, and may allow for extension to other areas of 
biomedical data analysis. 



Background 

The analysis and classification of high-dimensional biome- 
dical data has been significantly facilitated via the use of 
dimensionality reduction techniques, which allow classifier 
schemes to overcome issues such as the curse of dimen- 
sionality. This is an issue where the number of variables 
(features) is disproportionately large compared to the 
number of training instances (objects) [1]. Dimensionality 
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reduction (DR) involves the projection of data originally 
represented in a AT- dimensional (AAD) space into a lower 
^-dimensional («-D) space (known as an embedding) such 
that n « N. DR techniques are broadly categorized as lin- 
ear or non-linear, based on the type of projection method 
used. 

Linear DR techniques make use of simple linear projec- 
tions and consequently linear cost functions. An example 
of a linear DR scheme is Principal Component Analysis 
[2] (PCA) which projects data objects onto the axes of 
maximum variance. However, maximizing the variance 
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within the data best preserves class discrimination only 
when distinct separable clusters are present within the 
data, as shown in [3], In contrast, non-linear DR involves 
a non-linear mapping of the data into a reduced dimen- 
sional space. Typically these methods attempt to project 
data so that relative local adjacencies between high 
dimensional data objects, rather than some global mea- 
sure such as variance, are best preserved during data 
reduction from AT- to w-D space [4]. This tends to better 
retain class-discriminatory information and may also 
account for any non-linear structures that exist in the 
data (such as manifolds), as illustrated in [5]. Examples of 
these techniques include locally linear embedding [5] 
(LLE), graph embedding [6] (GE), and isometric mapping 
[7] (ISOMAP). Recent work has shown that in several 
scenarios, classification accuracy may be improved via 
the use of non-linear DR schemes (rather than linear DR) 
for gene-expression data [4,8] as well as medical imagery 
[9,10]. 

However, typical DR techniques such as PCA, GE, or 
LLE may not guarantee an optimum result due to one 
or both of the following reasons: 

♦ Noise in the original N-D space tends to adversely 
affect class discrimination, even if robust features are 
used (as shown in [11]). A single DR projection may 
also fail to account for such artifacts (demonstrated 
in [12,13]). 

♦ Sensitivity to choice of parameters being specified 
during projection; e.g. in [14] it was shown that 
varying the neighborhood parameter in ISOMAP 
can lead to significantly different embeddings. 

In this paper, we present a novel DR scheme known as 
consensus embedding which aims to overcome the pro- 
blems of sensitivity to noise and choice of parameters that 
plague several popular DR schemes [12-14]. The spirit 
behind consensus embedding is to construct a single 
stable embedding by generating and combining multiple 
uncorrelated, independent embeddings; the hypothesis 
being that this single stable embedding will better preserve 
specific types of information in the data (such as class- 
based separation) as compared to any of the individual 
embeddings. Consensus embedding may be used in con- 
junction with either linear or non-linear DR methods and, 
as we will show, is intended to be easily generalizable to a 
large number of applications and problem domains. In 
this work, we will demonstrate the superiority of the con- 
sensus embedding representation for a variety of classifica- 
tion and clustering applications. 

Figure 1 illustrates an application of consensus embed- 
ding in separating foreground (green) and background 
(red) regions via pixel-level classification. Figure 1(a) 
shows a simple RGB image to which Gaussian noise was 



added to the G and B color channels (see Figure 1(b)). We 
now consider each of the 3 color channels as features (i.e. 
N = 3) for all of the image objects (pixels). Classification 
via replicated /c-means clustering [15] of all the objects 
(without considering class information) was first per- 
formed using the noisy RGB feature information (Figure 1 
(b)), in order to distinguish the foreground from 
background. 

The labels so obtained for each object (pixel) are then 
visualized in the image shown in Figure 1(c), where the 
color of the pixel corresponds to its cluster label. The 2 
colors in Figure 1(c) hence correspond to the 2 classes 
(clusters) obtained. No discernible regions are observable 
in this figure. Application of DR (via GE) reduces the data 
to a n = 2-D space, where the graph embedding algorithm 
[6] non-linearly projects the data such that the object 
classes are maximally discriminable in the reduced dimen- 
sional space. However, as seen in Figure 1(d), clustering 
this reduced embedding space does not yield any 
obviously discernible image partitions either. 

By plotting all the objects onto 2D plots using only the 
R-G (Figure 1(e)) and R-B (Figure 1(f)) color channels 
respectively, we can see that separation between the two 
classes exists only along the R axis. In contrast, the 2D G- 
B plot (Figure 1(g)) shows no apparent separation between 
the classes. Combining ID embeddings obtained via apply- 
ing graph embedding to Figures 1(e) and 1(f), followed by 
unsupervised clustering, yields the consensus embedding 
result shown in Figure 1(h). Consensus embedding clearly 
results in superior background/foreground partitioning 
compared to the results shown in Figures 1(c), (d). 

Related Work and Significance 
Classifier and clustering ensembles 

Researchers have attempted to address problems of classi- 
fier sensitivity to noise and choice of parameters via the 
development of classifier ensemble schemes, such as 
Boosting [16] and Bagging [17]. These classifier ensembles 
guarantee a lower error rate as compared to any of the 
individual members (known as "weak" classifiers), assum- 
ing that the individual weak classifiers are all uncorrelated 
[18]. Similarly a consensus-based algorithm has been pre- 
sented [15] to find a stable unsupervised clustering of data 
using unstable methods such as /c-means [19]. Multiple 
"uncorrelated" clusterings of the data were generated and 
used to construct a co-association matrix based on cluster 
membership of all the points in each clustering. Naturally 
occurring partitions in the data were then identified. This 
idea was further extended in [20] where a combination of 
clusterings based on simple linear transformations of 
high-dimensional data was considered. Note that ensemble 
techniques thus (1) make use of uncorrelated, or relatively 
independent, analyses (such as classifications or projec- 
tions) of the data, and (2) combine multiple analyses (such 
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Figure 1 Region partitioning of toy image data, (a) Original RGB image to which Gaussian noise was added to create (b) noisy RGB image. 
Image visualization of classes obtained by replicated /r-means clustering [15] of all the pixels via (c) original noisy RGB space, and (d) graph 
embedding [6] of noisy RGB data. 2D plots of (e) R-G, (f) R-B, and (g)G-B planes are also shown where colors of objects plotted correspond to 
the region in (b) that they are derived from. The discriminatory 2D spaces ((e) and (f)) are combined via consensus embedding, and the 
visualized classification result is shown in (h). Note the significantly better image partitioning into foreground and background of (h) compared 
to (c) and (d). 



as classifications or projections) to enable a more stable 
result. 

Improved DR schemes to overcome parameter sensitivity 

As shown by [7], linear DR methods such as classical 
multi-dimensional scaling [21] are unable to account for 
non-linear proximities and structures when calculating 
an embedding that best preserves pairwise distances 
between data objects. This led to the development of 
non-linear DR methods such as LLE [5] and ISOMAP 
[7] which make use of local neighborhoods to better cal- 
culate such proximities. As previously mentioned, DR 
methods are known to suffer from certain shortcomings 
(sensitivity to noise and/or change in parameters). A 
number of techniques have recently been proposed to 
overcome these shortcomings. 

In [22,23] methods were proposed to choose the opti- 
mal neighborhood parameter for ISOMAP and LLE 
respectively. This was done by first constructing multi- 
ple embeddings based on an intelligently selected subset 
of parameter values, and then choosing the embedding 
with the minimum residual variance. Attempts have 
been made to overcome problems due to noisy data by 
selecting data objects known to be most representative 
of their local neighborhood (landmarks) in ISOMAP 
[24], or estimating neighborhoods in LLE via selection 
of data objects that are unlikely to be outliers (noise) 



[13]. Similarly, graph embedding has also been explored 
with respect to issues such as the scale of analysis and 
determining accurate groups in the data [25]. However, 
all of these methods require an exhaustive search of the 
parameter space in order to best solve the specific pro- 
blem being addressed. Alternatively, one may utilize 
class information within the supervised variants [26,27] 
of ISOMAP and LLE which attempt to construct 
weighted neighborhood graphs that explicitly preserve 
class information while embedding the data. 
Learning in the context of dimensionality reduction 
The application of classification theory to DR has begun to 
be explored recently. Athitsos et al presented a nearest 
neighbor retrieval method known as BoostMap [28], in 
which distances from different reference objects are com- 
bined via boosting. The problem of selecting and weight- 
ing the most relevant distances to reference objects was 
posed in terms of classification in order to utilize the Ada- 
boost algorithm [16], and BoostMap was shown to 
improve the accuracy and speed of overall nearest neigh- 
bor discovery compared to traditional methods. DR has 
also previously been formulated in terms of maximizing 
the entropy [29] or via a simultaneous dimensionality 
reduction and regression methodology involving Bayesian 
mixture modeling [30]. The goal in such methods is to 
probabilistically estimate the relationships between points 
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based on objective functions that are dependent on the 
data labels [29]. These methods have been demonstrated 
in the context of application of PCA to non-linear datasets 
[30]. More recently, multi-view learning algorithms [31] 
have attempted to address the problem of improving the 
learning ability of a system by considering several disjoint 
subsets of features (views) of the data. The work most clo- 
sely related to our own is that of [32] in the context of 
web data mining via multi-view learning. Given that a hid- 
den pattern exists in a dataset, different views of this data 
are each embedded and transformed such that known 
domain information (encoded via pairwise link con- 
straints) is preserved within a common frame of reference. 
The authors then solve for a consensus pattern which is 
considered the best approximation of the underlying hid- 
den pattern being solved for. A similar idea was examined 
in [33,34] where ID projections of image data were co- 
registered in order to better perform operations such as 
image-based breathing gating as well as multi-modal regis- 
tration. Unlike consensus embedding, these algorithms 
involve explicit transformations of embedding data to a 
target frame of reference, as well as being semi-supervised 
in encoding specific link constraints in the data. 
Intuition and significance of consensus embedding 
In this paper we present a novel DR scheme (consensus 
embedding) that involves first generating and then combin- 
ing multiple uncorrelated, independent (or base) «-D 
embeddings. These base embeddings may be obtained via 
either linear or non-linear DR techniques being applied to 
a large 7V-D feature space. Note that we use the terms 
"uncorrelated, independent" with reference to the method 
of constructing base embeddings; similar to their usage in 
ensemble classification literature [18]. Indeed, techniques 
to generate multiple base embeddings may be seen to be 
analogous to those for constructing classifier ensembles. In 
the latter, base classifiers with significant variance can be 
generated by varying the parameter associated with the 
classification method (k in /cNN classifiers [35]) or by vary- 
ing the training data (combining decision trees via Bagging 
[17]). Previously, a consensus method for LLE was exam- 
ined in [36] with the underlying hypothesis that varying 
the neighborhood parameter (n) will effectively generate 
multiple uncorrelated, independent embeddings for the 
purposes of constructing a consensus embedding. The 
combination of such base embeddings for magnetic reso- 
nance spectroscopy data was found to result in a low- 
dimensional data representation which enabled improved 
discrimination of cancerous and benign spectra compared 
to using any single application of LLE. In this work we 
shall consider an approach inspired by random forests [37] 
(which in turn is a modification of the Bagging algorithm 
[17]), where variations within the feature data are used to 
generate multiple embeddings which are then combined 
via our consensus embedding scheme. Additionally, unlike 



most current DR approaches which require tuning of asso- 
ciated parameters for optimal performance in different 
datasets, consensus embedding offers a methodology that 
is not significantly sensitive to parameter choice or dataset 
type. 

The major contributions of our work are: 

■ A novel DR approach which generates and com- 
bines embeddings. 

■ A largely parameter invariant scheme for dimen- 
sionality reduction. 

■ A DR scheme easily applicable to a wide variety of 
pattern recognition problems including image parti- 
tioning, data mining, and high dimensional data 
classification. 

The organization of the rest of this paper is as follows. 
In Section 2 we will examine the theoretical grounding 
and properties of consensus embedding, followed by algo- 
rithms to efficiently implement the consensus embedding 
scheme. In Section 3 we show the application of consensus 
embedding in the context of (1) partitioning of synthetic 
as well as clinical images, and (2) classification of gene- 
expression studies. Quantitative and qualitative results of 
this evaluation, as well as discussion of the results and 
concluding remarks, are presented in Section 4. 

Methods 

Theory of Consensus Embedding 

The spirit of consensus embedding lies in the generation 
and combination of multiple embeddings in order to 
construct a more stable, stronger result. Thus we will 
first define various terms associated with embedding con- 
struction. Based on these, we can mathematically forma- 
lize the concept of generating and combining multiple 
base embeddings, which will in turn allow us to derive 
necessary and sufficient conditions that must be satisfied 
when constructing a consensus embedding. Based on 
these conditions we will describe the specific algorithmic 
steps in more detail. Notation that is used in this section 
is summarized in Table 1. 
Preliminaries 

An object shall be referred to by its label c and is defined 
as a point in an A/-dimensional space R . It is represented 
by an Af-tuple F(c) comprising its unique Af-dimensional 
co-ordinates. In a sub-space R" c FT such that n << N, 
this object c in a set C is represented by an «-tuple of its 
unique w-dimensional coordinates X(c). R" is also known 
as the embedding of objects c e C and is always calculated 
via some projection of R . For example in the case of R , 
we can define F(c) = \f\,fi,f?S based on the co-ordinate 
locations (fi,f2,f-i) on each of the 3 axes for object c e C. 
The corresponding embedding vector of c e C in El will 
be X(c) = {e 1( e 2 } with co-ordinate axes locations (ex, e 2 ). 
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Table 1 Notation and symbols 



K. 


High(/V)-dimensional space 




Low(n)-dimensional space 


c, d, e 


Objects in set C 


z 


Number of unique triplets in C 


F(c) 


High-dimensional feature vector 


X(c) 


Embedding vector 




Pairwise relationship in R w 


<f" 


Pairwise relationship in R" 


A(c, d, e) 


Triangle relationship (Defn. 1) 


y/ s (R") 


Embedding strength (Defn. 2) 


R" 


True embedding (Defn. 3) 




Pairwise relationship in on 


R" 


Strong embedding (Defn. 4) 


R" 


Weak embedding 


R" 


Consensus embedding (Defn. 5) 


£ctf 


Pairwise relationship in j^n 


M 


Number of generated embeddings 


K 


Number of selected embeddings 


R 


Number of objects in C 


X(c) 


Consensus embedding vector 



Summary of notation and symbols used in this paper. 



Note that in general, determining the target dimensionality 
(«) for any R may be done by a number of algorithms 
such as the one used in this work [38] . 

The notation A cd , henceforth referred to as the pairwise 
relationship, will represent the relationship between two 
objects c, d e C with corresponding vectors F(c), F{d) e 
R^. Similarly, the notation S cd will be used to represent 
the pairwise relationship between two objects c, d e C 
with embedding vectors X(c), X(<i) e R". We assume that 
this relationship satisfies the three properties of a metric 
(e.g. Euclidean distance). Finally, a triplet of objects c, d, e 
e C is referred to as an unique triplet if c * d, d *■ e, and c 

e. Unique triplets will be denoted simply as (c, d, e). 
Definitions 

Definition 1 The function A defined on a unique triplet (c, 
d, e) is called a triangle relationship, A(c, d, e), if when A cd 
< A ce and A cd <A de , then <f d < d ce and S° d < 8 ' ' 



■dc 



Note that according to Definition 3, the most optimal 
true embedding may be considered to be the original 
R N itself, i.e. s cd = A cd - However, as R N may not be opti- 
mal for classification (due to the curse of dimensional- 
ity), we are attempting to approximate a true 
embedding as best possible in «-D space. Note that mul- 
tiple true embeddings in «-D space may be calculated 
from a single R w ; any one of these may be chosen to 
calculate gcd. 

Practically speaking, any R" will be associated with some 
degree of error compared to the original R . This is 
almost a given since some loss of information and conco- 
mitant error can be expected to occur in going from a 
high- to a low-dimensional space. We can calculate the 
probability of pairwise relationships being accurately pre- 
served from R to R" i.e. the probability that A(c, d, e) = 1 
for any unique triplet (c, d, e) e C in any R" as, 



For objects c, d, e e C whose relative pairwise relation- 
ships in R N are preserved in R", the triangle relationship A p(A|c, d, e, R") = 
(c, d, e) = 1. For ease of notation, the triangle relationship 
A(c, d, e) will be referred to as A where appropriate. Note 
that for a set of R unique objects (R = \C\, \.\ is cardinality 
of a set), Z = 3 ,^ 3 j| unique triplets may be formed. 

Definition 2 Given Z unique triplets (c, d, e) e C and 



£ c A(c,d, e) 



(1) 



an embedding R" of all objects c, d, e e C, the associated 
embedding strength i/f BS (R") = £cA ^- d - e ) . 

The embedding strength (ES) of an embedding R", 
denoted (// £S (R"), is hence the fraction of unique triplets 
(c, d, e) e C for which A(c, d, e) = 1. 

Definition 3 A true embedding, is an embedding 
for which i/f BS (R") = 1. 

A true embedding fcn is one for which the triangle 
relationship is satisfied for all unique triplets (c, d, e) e 
C, hence perfectly preserving all pairwise relationships 
from R to Additionally, for all objects c, d <= C in 
the pairwise relationship is denoted as §cd. 



More details on this formulation may be found in the 
Appendix. Note that the probability in Equation 1 is 
binomial as the complementary probability to p(A\c, d, 
e, R") (i.e. the probability that A(c, d, e) * 1 for any 
unique triplet (c, d, e) e C in any R") is given by 1 - p 
(A|c, d, e, R") (in the case of binomial probabilities, 
event outcomes can be broken down into two probabil- 
ities which are complementary, i.e. they sum to 1). 

Definition 4 A strong embedding, R", is an embedding 
for which i/r £S (R") > Q. 

In other words, a strong embedding is defined as one 
which accurately preserves the triangle relationship for 
more than some significant fraction (9) of the unique 
triplets of objects c, d, e e C that exist. An embedding 
R" which is not a strong embedding is referred to as a 
weak embedding, denoted as j>". 
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We can calculate multiple uncorrelated (i.e. indepen- 
dent) embeddings from a single R which may be 
denoted as R" n , m e {1, . . . ,M\, where M is total number 
of possible uncorrelated embeddings. Note that both 
strong and weak embeddings will be present among all 
of the M possible embeddings. All objects c, d e C can 
then be characterized by corresponding embedding vec- 
tors X m (c),X m (c?) e R" n with corresponding pairwise 
relationship <S™. Given multiple Sjjf, we can form a distri- 
bution p[X = over all M embeddings. Our hypoth- 
esis is that the maximum likelihood estimate (MLE) of 
p(X = S^), denoted as g cd , will approximate the true 
pairwise relationship fyd for objects c, d e C. 

Definition 5 An embedding R" is called a consensus 
embedding, r", if for all objects c ,d e C, S cd = S cd - 

We denote the consensus embedding vectors for all 
objects c e C by X(c) e R"- Additionally, from Equation 
1> p(A\c,d,e, R") represents the probability that A(c, d, 
e) = 1 for any (c, d, e) e C in Rn. 
Necessary and sufficient conditions for consensus 
embedding 

While r« is expected to approximate r" as best possible, it 
cannot be guaranteed that i/f fiS (R") = 1 as this is depen- 
dent on how well g cd approximates §<d, for all objects c, d 
e C. $ cd may be calculated inaccurately as a result of con- 
sidering pairwise relationships derived from weak embed- 
dings, present amongst the M embeddings that are 
generated. As Proposition 1 and Lemma 1 below demon- 
strate, in order to ensure that i/f £S (R") -+ 1, R" must be 
constructed from a combination of multiple strong 
embeddings [R« alone, so as to avoid including weak 
embeddings. 

Proposition 1 If K < M independent, strong embeddings 
R£, k e {1, . . . , K], with a constant p{ A|c, d, e, R^)that A(c, 
d, e) = 1 for all (c, d, e) e C, are used to calculate 
i/r £S (R") -> lasK^ 

Proof. If K < M independent, strong embeddings alone 
are utilized in the construction of R n , then the number of 
weak embeddings is (M - K). As Equation 1 represents a 
binomial probability, p(A\c, d, e, R") can be approximated 
via the binomial formulation of Equation 1 as, 

KA| C ,d,e,R") = X:(^)« K (l-«)^ (2) 

where a = p(A|c, d, e, RJJ) (Equation 1) is considered to 
be constant. Based on Equation 2, as K — > °°, 
p(A|c, d, e, R") — > L which in turn implies that 
i/f £S (R") L therefore r« approaches R". D 

Proposition 1 demonstrates that for a consensus 
embedding to be strong, it is sufficient that strong 
embeddings be used to construct it. Note that as K — > M, 



p{A\c,d,e, R£) > 9,p{A\c,d,e, R") >> 6. In other words, 
if p(A\c,d,e, R£) > 0,p(A\c,d,e, R") >> 0. Based on 
Equation 1 and Definitions 2, 4, this implies that as K — > 
M, i/r £s (R") >> Q. Lemma 1 below demonstrates the 
necessary nature of this condition i.e. if weak embeddings 
are considered when constructing i/r £s (R") << 9 (it 
will be a weak embedding). 

Lemma 1 If K < M independent, weak embeddings 
R£,ke {1,...,K}, with ^{RD < 0, are used to calcu- 
late Rn, then i/f £s (R") << 9. 

Proof. From Equation 1 and Definitions 2, 4, if 
if ES {RD<9, then p{A\c, d, e, R n k ) < 9. Substituting 
p{A\c,d t e,R'l) in Equation 2, will result in 
p{A\c,d,e, R") << ft Thus ^ Es (R n ) « 9, and R" will be 
weak. D 

Proposition 1 and Lemma 1 together demonstrate the 
necessary and sufficient nature of the conditions required 
to construct a consensus embedding: that if a total of M 
base embeddings are calculated from a single R , some 
minimum number of strong embeddings {K < M) must be 
considered to construct a r« that is a strong embedding. 
Further, a j>n so constructed will have an embedding 
strength i/r(R") that will increase significantly as we 
include more strong embeddings in its computation. 
Appendix B demonstrates an additional property of R" 
showing that it preserves information from Vi with less 
inherent error than any R" used in its construction. 

Algorithms and Implementation 

Based on Proposition 1, 3 distinct steps are typically 
required for calculating a consensus embedding. First, we 
must generate a number of base embeddings (M), the 
steps for which are described in CreateEmbed. We then 
select for strong embeddings from amongst M base 
embeddings generated, described in SelEmbed. We will 
also discuss criteria for selecting strong embeddings. 
Finally, selected embeddings are combined to result in the 
final consensus embedding representation as explained in 
CalcConsEmbed. We also discuss some of the computa- 
tional considerations of our implementation. 
Creating n-dimensional data embeddings 
One of the requirements for consensus embedding is the 
calculation of multiple uncorrelated, independent embed- 
dings R" from a single R . This is also true of ensemble 
classification systems such as Boosting [16] and Bagging 
[17] which require multiple uncorrelated, independent 
classifications of the data to be generated prior to combi- 
nation. As discussed previously, the terms "uncorrelated, 
independent" are used by us with reference to the 
method of constructing embeddings, as borrowed from 
ensemble classification literature [18]. Similar to random 
forests [37], we make use of a feature space perturbation 
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technique to generate uncorrelated (base) embeddings. 
This is implemented by first creating M bootstrapped 
feature subsets of V features each (every subset r\ m , m e 
{1, . . . , M} containing (y) features, no DR involved). 
Note, that the number of samples in each V-dimensional 
subset is the same as in the original AT-dimensional space. 
Each V-dimensional X] m is then embedded in n-D space 
via DR (i.e. projecting from R to R"). M is chosen such 
that each of N dimensions appears in at least one r\ m . 

Algorithm CreateEmbed 

Input: F(c) e R N for all objects c e C, n 

Output: X m (c) e R£, m e {1, . .. ,M) 

Data Structures: Feature subsets T] m , total number of 
subsets M, number of features in each subset V, DR 
method <J> 

begin 

0. for m = 1 to M do 

1. Select V < N features from R , forming subset 

2. Calculate X m (c) e RJJ,, for all c e C using r\ m 
and method <J>; 

3. endfor 
end 

As discussed in the introduction, multiple methods exist 
to generate base embeddings, such as varying a parameter 
associated with a method (e.g. neighborhood parameter in 
LLE, as shown in [36]) as well as the method explored in 
this paper (feature space perturbation). These methods are 
analogous to methods in the literature for generating base 
classifiers in a classifier ensemble [18], such as varying k in 
/cNN classifiers (changing associated parameter) [39], or 
varying the training set for decision trees (perturbing the 
feature space) [37]. 
Selection of strong embeddings 

Having generated M base embeddings, we first calcu- 
late their embedding strengths \fr ES (R" n ) for all 
K"„, me {1, . . . ,M}. The calculation of y/^ can be done 
via performance evaluation measures such as those 
described below, based on the application and prior 
domain knowledge. Embeddings for which 
i/f £S (IR[J,) > 9 are then selected as strong embeddings, 
where 8 is a pre-specified threshold. 
Algorithm SelEmbed 

Input: X m (c) e [RJJ, for all objects c e C, m e {1, . . . , 
M} 

Output: Xfc(c) e RJ,fce {1,...,K} 
Data Structures: A list Q, embedding strength func- 
tion \fF s , embedding strength threshold 0 

begin 

0. for m = 1 to M do 

1. Calculate ir ES ^R n m ); 

2. ifis ES {R n m ) > 9 

3. Put m in Q; 



4. endif 

5. endfor 

6. For each element k of Q, store X; ; (c) e R ( " for all 
objects c e C; 

end 

Note that while 9 may be considered to be a para- 
meter which needs to be specified to construct the con- 
sensus embedding, we have found in our experiments 
that the results are relatively robust to variations in 0. In 
general, 0 may be defined based on the manner of eval- 
uating the embedding strength, as discussed in the next 
section. 

Evaluation of embedding strength 

We present two performance measures in order to evalu- 
ate embedding strength: one measure being supervised 
and relying on label information; the other being unsuper- 
vised and driven by the separability of distinct clusters in 
the reduced dimensional embedding space. In Experiment 
4 we compare the two performance measures against each 
other to determine their relative effectiveness in construct- 
ing a strong consensus embedding. 

Supervised evaluation of embedding strength We have 
demonstrated that embedding strength increases as a 
function of classification accuracy (Theorem 1, Appendix), 
implying that strong embeddings will have high classifica- 
tion accuracies. Intuitively, this can be explained as strong 
embeddings showing greater class separation compared to 
weak embeddings. Given a binary labeled set of samples C, 
we denote the sets of objects corresponding to the two 
classes as S + and S', such that C = S + U 5" and 5 + n S' = 0. 
When using a classification algorithm that does not con- 
sider class labels, we can evaluate classification accuracy as 
follows: 

1. Apply classification algorithm to C (embedded in 
R") to find T clusters (unordered, labeled set of 
objects), denoted via \J/ tJ £ g { 1, . . . , T}- 

2. For each ty t 

(a) Calculate DTP = |* t n S + |- 

(b) Calculate DTN = |(C - * t ) n S~[ 

(c) Calculate classification accuracy for \|» t , as 
(h^H^A = DTP+DTN 

V (^tJ \s*Us-\ ■ 

3. Calculate classification accuracy of R* as 
0*c(R») = max T [V cc (* r )]. 

As classification has been done without considering 
label information, we must evaluate which of the clus- 
ters so obtained shows the greatest overlap with S + (the 
class of interest). We therefore consider the classifica- 
tion accuracy of the cluster showing the most overlap 
with S + as an approximation of the embedding strength 
of R M , i.e. (/ S (R") « ^ CC (R"). 
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Unsupervised evaluation of embedding strength We 
utilize a measure known as the R-squared index (RSI), 
based off cluster validity measures [40], which can be 
calculated as follows: 

1. Apply classification algorithm to C (embedded in 
R") to find T clusters (unordered, labeled set of 
objects), denoted via *i» t/ t e { 1, . . . , T}- 

2. Calculate SST = £» x (X(c,) - X( Cj )) 2 ] 

(where X(c ; ) is the mean of data values in the /' 
dimension). 

3. Calculate SSB = £ j=1 ...„ fefi (X(c,) - X( C) )) 2 1. 

£=1-T L J 

4. Calculate R-squared index of R" as 

RSI may be considered both a measure of the degree of 
difference between clusters found in a dataset as well as 
measurement of the degree of homogeneity between 
them. The value of (p ranges between 0 and 1, where if 
<p RS = 0, no difference exists among clusters. Conversely, 
a value close to (p =1 suggests well-defined, separable 
clusters in the embedding space. Note that when using 
RSI to evaluate embedding strength, it will be difficult to 
ensure that all selected embeddings are strong without 
utilizing a priori information. In such a case we can 
attempt to ensure that a significant majority of the 
embeddings selected are strong, which will also ensure 
that the consensus embedding R" is strong (based off 
Proposition 1). 

Constructing the consensus embedding 

Given K selected embeddings fe e {1, . . . , K], we quan- 
tify pairwise relationships between all the objects in each 
Kj" via Euclidean pairwise distances. Euclidean distances 
were chosen for our implementation as they are well 
understood, satisfy the metric assumption of the pairwise 
relationship, as well as being directly usable within the 
other methods used in this work. Cl k denotes the ML esti- 
mator used for calculating from K observations S'jf for 
all objects c, de C. 
Algorithm CalcConsEmbed 

Input: Xfe(c) e R£ for all objects c e C, k e {1, . . . , I<\ 
Output: X(c) e R" 

Data Structures: Confusion matrix W, ML estimator 
n, projection method y 
begin 

0. for k = 1 to K do 

1. Calculate W k (i, j) = \ \X k (c) - X k (d)\\ 2 for all 
objects c, d e C with indices i, j; 

2. endfor 

3. Apply normalization to all W k , k e {1, . . . , K}; 

4. Obtain W{i,j) = Q k [W k {i,j)]Wc,d 6 C; 



5. Apply projection method / to w to obtain final 
consensus embedding R"; 
end 

Corresponding entries across all W k (after any necessary 
normalization) are used to estimate ^cd (and stored in w). 
In our implementation, we have used the median as the 
ML estimator as (1) the median is less corruptible to out- 
liers, and (2) the median and the expectation are inter- 
changeable if one assumes a normal distribution [41]. In 
Section 3 we compare classification results using both the 
mean and median individually as the ML estimator. We 
apply a projection method y, such as multi-dimensional 
scaling (MDS) [21], to the resulting \y to embed the 
objects in while preserving the pairwise distances 
between all objects c e C. The underlying intuition for 
this final step is based on a similar approach adopted in 
[15] where MDS was applied to the co-association matrix 
(obtained by accumulating multiple weak clusterings of 
the data) in order to visualize the clustering results. As W 
is analogous to the co-association matrix, the projection 
method y will allow us to construct the consensus embed- 
ding space j>". 

One can hypothesize that yj is an approximation of 
distances calculated in the original feature space. Dis- 
tances in the original feature space can be denoted as 
W(i,j) = ||F(c) - F(d)|| 2 Vc,d € C with indices i, j. An 
alternative approach could therefore be to calculate 
in the original feature space and apply y to it instead. 
However, noise artifacts in the original feature space 
may prevent it from being truly optimal for analysis 
[11]. As we will demonstrate in Section 3, simple DR, as 
well as consensus DR, provide superior representations 
of the data (by accounting for noise artifacts) as com- 
pared to using the original feature space directly. 
Computational efficiency of Consensus Embedding 
The most computationally expensive operations in con- 
sensus embedding are (1) calculation of multiple uncorre- 
cted embeddings (solved as an eigenvalue problem in O 
(w 3 ) time for n objects), and (2) computation of pairwise 
distances between all the objects in each strong embed- 
ding space (computed in time 0(n 2 ) for n objects). A slight 
reduction in both time and memory complexity can be 
achieved based on the fact that distance matrices will be 
symmetric (hence only the upper triangular need be calcu- 
lated). Additionally, multiple embeddings and distance 
matrices can be computed via code parallelization. How- 
ever these operations still scale polynomially based on the 
number of objects n. 

To further reduce the computational burden we embed 
the consensus embedding paradigm within an intelligent 
sub-sampling framework. We make use of a fast imple- 
mentation [42] of the popular mean shift algorithm [43] 
(MS) to iteratively represent data objects via their most 
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representative cluster centers. As a result, the space retains 
its original dimensionality, but now comprises only some 
fractional number {nit) of the original objects. These nit 
objects are used in the calculations of consensus embed- 
ding as well as for any additional analysis. A mapping 
{Map) is retained from all n original objects to the final nl 
t representative objects. We can therefore map back 
results and analyses from the lowest resolution {nit 
objects) to the highest resolution {n objects) easily. The 
fewer number of objects {nit « n) ensures that consensus 
embedding is computationally feasible. In our implemen- 
tation, t was determined automatically based on the num- 
ber of stable cluster centers detected by MS. 

Algorithm ConsEmbedMS 

Input: F(c) e R N for all objects c e C, n 

Output: X( C ) e R" 

Data Structures: Reduced set of objects c e C 
begin 

0. Apply MS [42] to R N resulting in for sub- 
sampled set of objects c e C 

1. Save Map from sub-sampled set of objects c e C 
to original set of objects c e C; 

2. X m (c) = CreateEmbed(V{c)\v m , *, M, V),Vm e {1, M}; 

3. Xi(c) = SelEmbed[X m {c)\Q, ijj , 0),Vfee (1 K],Vm e (1, M); 

4. X(c) = CalcConsEmbed(X k (c)\W, Q, y), Vfe e {1, . . . , K}; 

5. Use MS and Map to calculate X(c) e R" from 
X(c) e R" for all objects c e C; 

end 

For an MRI image comprising 5589 pixels (objects) for 
analysis, the individual algorithms CreateEmbed, 
SelEmbed and CalcConsEmbed took 121.33, 12.22, and 
35.75 seconds respectively to complete (on average). By 
implementing our mean-shift optimization it took only 
119 seconds (on average) for ConsEmbedMS to com- 
plete analysis of an MRI image comprising between 
15,000 and 40,000 pixels (objects); a calculation that 
would have been computationally intractable otherwise. 
All experiments were conducted using MATLAB 7.10 
(Mathworks, Inc.) on a 72 GB RAM, 2 quad core 2.33 
GHz 64-bit Intel Core 2 processor machine. 

Experimental Design for Evaluating Consensus 
Embedding 

Dataset description 

The different datasets used in this work included: (1) syn- 
thetic brain image data, (2) clinical prostate image data, 
and (3) gene-expression data (comprehensively summar- 
ized in Table 2). The overarching goal in each experiment 
described was to determine the degree of improvement in 
class-based separation via the consensus embedding repre- 
sentation as compared to alternative representations 
(quantified in terms of classification accuracy). Note that 



in the case of prostate images as well as gene-expression 
data we have tested the robustness of the consensus 
embedding framework via the use of independent training 
and testing sets. 

In the case of image data (brain, prostate MRI), we have 
derived texture features [44] on a per-pixel basis from each 
image. These features are based on calculating statistics 
from a gray level intensity co-occurrence matrix con- 
structed from the image, and were chosen due to pre- 
viously demonstrated discriminability between cancerous 
and non-cancerous regions in the prostate [45] and differ- 
ent types of brain matter [46] for MRI data. Following fea- 
ture extraction, each pixel c in the MR image is associated 
with a N dimensional feature vector F(c) = [/L(c)|w e {1, . . . 
, A/}] e Pi , where f u {c) is the response to a feature operator 
for pixel c. In the case of gene-expression data, every sam- 
ple c is considered to be associated with a high-dimen- 
sional gene-expression vector, also denoted F(c) e R N . 

DR methods utilized to reduce R to R" were graph 
embedding (GE) [6] and PCA [2]. These methods were 
chosen in order to demonstrate instantiations of consen- 
sus embedding using representative linear and non-linear 
DR schemes. Additionally, these methods have been lever- 
aged both for segmentation as well as classification of 
similar biomedical image and bioinformatics datasets in 
previous work [47,48]. The dimensionality of the embed- 
ding space, n, is calculated as the intrinsic dimensionality 
of R via the method of [38] . To remain consistent with 
notation defined previously, the result of DR on F(c) e R 
is denoted Xc l3 (c) e R", while the result of consensus DR 
will be denoted X<j>(c) e R" The subscript <I> corresponds 
to the DR method used, <J> e {GE, PCA}. For ease of 
description, the corresponding classification results are 
denoted ¥(F), T(X <t ), *(X$), respectively. 

Experiment 1: Synthetic MNI brain data 

Synthetic brain data [49] was acquired from BrainWeb , 
consisting of simulated proton density (PD) MRI brain 
volumes at various noise and bias field inhomogeneity 
levels. Gaussian noise artifacts have been added to each 
pixel in the image, while inhomogeneity artifacts were 
added via pixel-wise multiplication of the image with an 
intensity non-uniformity field. Corresponding labels for 
each of the separate regions within the brain, including 
white matter (WM) and grey matter (GM), were also 
available. Images comprising WM and GM alone were 
obtained from 10 sample slices (ignoring other brain tis- 
sue classes). The objective was to successfully partition 
GM and WM regions on these images across all 18 
combinations of noise and inhomogeneity, via pixel-level 
classification (an application similar to Figure 1). Classi- 
fication is done for all pixels c e C based on each of, 
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Table 2 Datasets 



Datasets 


Description 




Features 


Synthetic brain MRI 
images 


10 slices (109 x 131 comprising 5589 pixels), 6 noise levels (0%, 1%, 3%, 5%, 7 C 
inhomogeneity levels (0%, 20%, 40%) 


6, 9%) 3 RF 


Haralick (14) 


Prostate MRI images 


16 slices, 2 datasets (256 x 256 comprising 15,000-40,000 pixels) 




Haralick, 1st order 
statistical (38) 


Gene-Expression 
data: 








Prostate Tumor 


102 training, 34 testing, 12,600 genes 






Breast Cancer 


78 training, 19 testing, 24,481 genes 




300 most class- 



Relapse 

Lymphoma 38 training, 34 testing, 7130 genes informative genes 

Lung Cancer 32 training, 149 testing, 12,533 genes 



Image and gene-expression datasets used in our experiments. 



(i) the high-dimensional feature space F(c) e R , 
N = 14, 

(ii) simple GE on F(c), denoted X G£ (c) e R", n = 3, 

(iii) multi-dimensional scaling (MDS) on distances 
calculated directly in ¥i N , denoted as X MDS (c) e R", 
n = 3 (alternative to consensus embedding, explained 
in Section 2), 

(iv) consensus embedding, denoted Xge(c) e R n , n = 3. 

The final slice classification results obtained for each 
of these spaces are denoted as ^(F), ^{Xge), ^(Xmds)) 
*(Xc £ ), respectively. 

Experiment 2: Comparison of ML estimators in consensus 
embedding 

For the synthetic brain data [49], over all 18 combinations 
of noise and inhomogeneity and over all 10 images, we 
compare the use of mean and median as ML estimators in 
CalcConsEmbed. This is done by preserving outputs from 
SelEmbed and only changing the ML estimator in the 
CalcConsEmbed. We then compare classification accura- 
cies for detection of white matter in each of the resulting 
consensus embedding representations, an d X^T" 
(superscript denotes choice of ML estimator). 

Experiment 3: Clinical prostate MRI data 

Two different prostates were imaged ex vivo using a 4 
Tesla MRI scanner following surgical resection. The 
excised glands were then sectioned into 2D histological 
slices which were digitized using a whole slide scanner. 
Regions of cancer were determined via Haemotoxylin and 
Eosin (H&E) staining of the histology sections. The cancer 
areas were then mapped onto corresponding MRI sections 
via a deformable registration scheme [50]. Additional 
details of data acquisition are described in [45] . 

For this experiment, a total of 16 4 Tesla ex vivo T2- 
weighted MRI and corresponding digitized histology 
images were considered. The purpose of this experiment 



was to accurately identify cancerous regions on prostate 
MRI data via pixel-level classification, based on exploiting 
textural differences between diseased and normal regions 
on T2-weighted MRI [45]. For each MRI image, M 
embeddings, R" n , me {1, . .. ,M}, were first computed (via 
CreateEmbed) along with their corresponding embedding 
strengths ^(IRJJ,) (based on clustering classification accu- 
racy). Construction of the consensus embedding was per- 
formed via a supervised cross-validation framework, which 
utilized independent training and testing sets for selection 
of strong embeddings {SelEmbed). The algorithm proceeds 
as follows, 

(a) Training {S tr ) and testing (S te ) sets of the data 
(MRI images) were created. 

(b) For each element (image) of S tr , strong embed- 
dings were identified based on 
6 = 0.15 x max M [f{R^)]. 

(c) Those embeddings voted as being strong across 
all the elements (images) in S tr were then identified 
and selected. 

(d) For the data (images) in S te , corresponding 
embeddings were then combined (via CalcCon- 
sEmbed) to yield the final consensus embedding 
result. 

A leave-one-out cross-validation strategy was 
employed in this experiment. A comparison is made 
between the pixel-level classifications for (1) simple GE 
denoted as *H{X GE ), and (2) consensus GE denoted as 

*(X G £> 

Experiment 4: Gene-expression data 

Four publicly available binary class gene-expression 
datasets were obtained with corresponding class labels 
for each sample [4]; the purpose of the experiment 
being to differentiate the two classes in each dataset. 
This data comprises the gene-expression vectorial data 
profiles of normal and cancerous samples for each 
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disease listed in Table 2, where the total number of 
samples range from 72 to 181 patients and the number 
of corresponding features range from 7130 to 24,481 
genes or peptides. All 4 data sets comprise independent 
training (S tr ) and testing (S te ) subsets, and these were 
utilized within a supervised framework for constructing 
and evaluating the consensus embedding representation. 

Prior to analysis, each dataset was first pruned to the 
300 most class-informative features based on t-statistics 
as described in [51]. The supervised cross-validation 
methodology for constructing the consensus embedding 
using independent training and testing sets is as follows, 

(a) First, CreateEmbed is run concurrently on data in 
S tl and S te , such that the same subsets of features are 
utilized when generating base embeddings for each 
of S tr and S te . 

(b) SelEmbed is then executed on base embeddings 
generated from S tr alone, thus selecting strong 
embeddings from amongst those generated. Strong 
embeddings were defined based on 
6 = 0.15 x maxM[HK)\ 

(c) Corresponding (selected) embeddings for data in 
S te are then combined within CalcConsEmbed to 
obtain the final consensus embedding vectors 
denoted as X$(c) e R", O e {GE, PCA}, n = 4. 

For this dataset, both supervised (via clustering classi- 
fication accuracy, superscript S) and unsupervised (via 
RSI, superscript US) measures of embedding strength 
were evaluated in terms of the classification accuracy of 
the corresponding consensus embedding 
representations. 

In lieu of comparative DR strategies, a semi-supervised 
variant of GE [52] (termed SSAGE) was implemented, 
which utilizes label information when constructing the 
embedding. Within this scheme, higher weights are 
given to within-class points and lower weights to points 
from different classes. When running SSAGE, both S tr 
and S te were combined into a single cohort of data, and 
labels corresponding to S tr alone were revealed to the 
SSAGE algorithm. 

An additional comparison was conducted against a 
supervised random forest-based /cNN classifier operating 
in the original feature space to determine whether DR 
provided any advantages in the context of high-dimen- 
sional biomedical data. This was implemented by train- 
ing a /cNN classifier on each of the feature subsets for 
S tr (that were utilized in Create Embed), but without 
performing DR on the data. Each such kNN classifier 
was then used to classify corresponding data in S te . The 
final classification result for each sample in S e is based 
on ensemble averaging to calculate the probability of a 



sample belonging to the target class. Classifications 
compared in this experiment were ^(F), V P(X SSG£ ), 
*p4 A > *(X* ca > *(X^> respectively. 

Classification 

For image data (brain, prostate MRI), classification was 
done via replicated /c-means clustering [15], while for 
gene-expression data, classification was done via hier- 
archical clustering [53]. The choice of clustering algo- 
rithm was made based on the type of data being 
considered in each of the different experiments, as well 
as previous work in the field. Note that both these clus- 
tering techniques do not consider class label information 
while classifying the data, and have been demonstrated 
as being deterministic in nature (hence ensuring repro- 
ducible results). The motivation in using such techni- 
ques for classification was to ensure that no classifier 
bias or fitting optimization was introduced during eva- 
luation. As our experimental intent was purely to exam- 
ine improvements in class separation offered by the 
different data representations, all improvements in cor- 
responding classification accuracies may be directly 
attributed to improved class discriminability in the cor- 
responding space being evaluated (without being depen- 
dent on optimizing the technique used for 
classification). 

Evaluating and visualizing results 

To visualize classification results as region partitions on 
the images (brain, prostate MRI), all the pixels were 
plotted back onto the image and assigned colors based 
on their classification label membership. Similar to the 
partitioning results shown in Figure 1, pixels of the 
same color were considered to form specific regions. 
For example, in Figure 1(h), pixels colored green were 
considered to form the foreground region, while pixels 
colored red were considered to form the background. 

Classification accuracy of clustering results for images 
as well as gene-expression data can be quantitatively 
evaluated as described previously (Section 2). Image 
region partitioning results as well as corresponding clas- 
sification accuracies of the different methods (GE, PCA, 
consensus embedding) were used to determine what 
improvements are offered by consensus embedding. 

Results and Discussion 

Experiment 1: Synthetic MNI Brain data 

Figure 2 shows qualitative pixel-level WM detection 
results on MNI brain data for comparisons to be made 
across 3 different noise and inhomogeneity combinations 
(out of 18 possible combinations). The original PD MRI 
image for selected combinations of noise and inhomo- 
geneity with the ground truth for WM superposed as a 
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Figure 2 WM detection results on synthetic BrainWeb data. Figure 2: Pixel-level WM detection results visualized for one image from the MNI 
brain MRI dataset, each row corresponding to a different combination of noise and inhomogeneity: (a)-(e) 1% noise, 0% inhomogeneity, (f)-(j) 3% 
noise, 20% inhomogeneity, (k)-(o) 7% noise, 40% inhomogeneity. The first column shows the original PD MRI image with the ground truth for 
WM outlined in red, while the second, third, fourth, and fifth columns show the pixel-level WM classification results for *P(F), W(X MDS ), ¥(X GE ), 
and ^(Xgb), respectively. The red and green colors in (b)-(e), (g)-(j), (l)-(o) denote the GM and WM regions identified in each result image. 



red contour is shown in Figures 2(a), (f), (k). Note that 
this is a 2 class problem, and GM (red) and WM (green) 
region partitions are visualized together in all the result 
images, as explained previously. Other brain tissue classes 
were ignored. Comparing the different methods used, 
when only noise (1%) is added to the data, all three of 
T(F) (Figure 2(b)), Y(X M£ , S ) (Figure 2(c)), and Y(X G£ ) 
(Figure 2(d)) are only able to identify the outer boundary 
of the WM region. However, ^(Xqe) (Figure 2(e)) shows 
more accurate detail of the WM region in the image 
(compare with the ground truth WM region outlined in 
red in Figure 2(a)). When RF inhomogeneity (20%) is 
added to the data for intermediate levels of noise (3%), 
note the poor WM detection results for *P(F) (Figure 2 
(g))_, WOImds) (Figure 2(h)), and Y(X G£ ) (Figure 2(i)). 
*(Xge) (Figure 2(j)), however, yields a more accurate 
WM detection result (compared to the ground truth 
WM region in Figure 2(f)). Increasing the levels of noise 



(7%) and inhomogeneity (40%) results in further degrada- 
tion of WM detection performance for V P(F) (Figure 2(1)), 
n^Mos) (Figure 2(m)), and >P(X G£ ) (Figure 2(n)). Note 
from Figure 2(o) that ^(Xge) appears to fare far better 
than Y(F), n^MDs), and Y(X G£ ). 

For each of the 18 combinations of noise and inhomo- 
geneity, we averaged the WM detection accuracies <^ cc 
(F), <p Acc (X MDS ), <p Acc (X GE ), </, ACC (X G£ )(calculated as 
described in Section 2) over all 10 images considered (a 
total of 180 experiments). These results are summarized 
in Table 3 (corresponding trend visualization in Figure 3) 
with accompanying standard deviations in accuracy. Note 
that 4> aCC (Xc,e) shows a consistently better performance 
than the remaining methods {(ff cc {V), (ff cc {X MDS ), (p Acc 
(X G£ )) in 17 out of 18 combinations of noise and inho- 
mogeneity. This trend is also visible in Figure 3. 

For each combination of noise and inhomogeneity, a 
paired Students' i-test was conducted between 4> acc (Xc,e) 
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Table 3 WM detection results for synthetic BrainWeb data 



Noise 


Inhomogeneity 


¥»* CC (F) 






<P Acc {Xc,e) 




0% 


65.55 ± 1.84 


65.55 ± 1.84 


65.55 ± 1 .84 


66.86 ± 2.89 


0% 


20% 


55.75 ± 1.65 


55.75 ± 1.65 


55.75 ± 1.65 


61.65 ± 4.58 




40% 


70.03 ± 2.79 


70.08 ± 2.82 


51.84 ± 0.99 


64.28 ± 5.93 




0% 


59.78 ±1.31 


59.74 ± 1.29 


74.71 ± 9.06 


80.62 ± 1.03 


1% 


20% 


59.36 ± 1.30 


59.32 ± 1.33 


60.95 ± 8.67 


73.07 ± 8.97 




40% 


59.20 ±1.12 


59.12 ±1.15 


56.38 ± 1.53 


66.46 ± 9.80 




0% 


53.35 ±1.31 


53.39 ± 1.27 


59.94 ± 7.00 


85.38 ± 0.75 


3% 


20% 


55.01 ± 2.92 


54.91 ± 3.1 1 


63.88 ± 1 0.85 


84.61 ± 0.81 




40% 


57.63 ± 1.78 


57.71 ± 1.67 


57.33 ± 1.38 


79.19 ± 7.56 




0% 


62.90 ± 0.72 


62.84 ± 0.66 


66.67 ± 1 0.22 


89.68 ± 1.36 


5% 


20% 


61.49 ± 1.38 


61.49 ± 1.42 


82.61 ± 7.39 


86.81 ± 1.38 




40% 


61.02 ± 0.99 


61.03 ± 1.09 


74.91 ± 9.09 


81.67 ± 1.51 




0% 


64.28 ± 0.71 


64.26 ± 0.76 


66.95 ± 6.25 


87.81 ± 0.73 


7% 


20% 


64.07 ± 1 .03 


64.01 ± 0.96 


74.22 ± 10.59 


86.07 ± 1.05 




40% 


64.05 ±1.19 


64.04 ±1.14 


64.44 ± 1 .25 


81.53 ± 1.57 




0% 


64.96 ± 0.90 


64.94 ± 0.88 


66.36 ± 1 .66 


75.51 ± 14.35 


9% 


20% 


64.85 ± 0.97 


64.79 ± 0.95 


65.68 ± 1.32 


78.18 ± 9.86 




40% 


64.65 ± 0.83 


64.63 ± 0.84 


65.30 ± 0.74 


77.83 ± 5.00 



riAci-icvei vvivi ueiti_uuii ai_i_uieiLy aiiu iiciiiudiu cmui ewtiayeu uvei iu iviini uiaiii 

(1) ¥(F) ( (2) ftXjwDs), (3) ^(Xge), (4) \I/(Xge) (with median as MLE). Improvements 
significant. 



mages and across 18 combinations of noise and inhomogeneity for each of: 
in classification accuracy via ^(X^) were found to be statistically 



and each of ^ CC (F), <{f cc {X MDS ), and <^ CC (X GE ), with the 
null hypothesis being that there was no improvement 
via ^(Xge) over all 10 brain images considered. ^'(Xge) 
was found to perform significantly better (p <0.05) than 
all of ¥(F), ^(X^s), and *F(X G£ ) in 16 out of 18 combi- 
nations of noise and inhomogeneity. 



Comparing (p Acc (F), ^"OCmds), and ^ CC (X GB ), it can 
be observed that V P(F) and X V{X. MDS ) perform similarly 
for all combinations of noise and inhomogeneity (note 
that the corresponding red and blue trend-lines comple- 
tely overlap in Figure 3). In contrast, ^(X^e) shows 
improved performance at every combination of noise 



-♦-F -B-MDS A-GE -^-Consensus GE (Median as MLE) Consensus GE (Mean as MLE) 




0.4 -I 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

N=0%, N=0%, N=0%, N=l%, N=l%, N=l%, N=3%, N=3%, N=3%, N=5%, N=5°/o, N=5%, N=7%, N=7%, N=7%, N=9%, N=9%, N=9%, 
RF=0% RF=20% RF=40% RF=0% RF=20% RF=40% RF=0% RF=20% RF=40% RF=0% RF=20% RF=40% RF=0% RF=20% RF=40% RF=0% RF=20% RF=40% 

Figure 3 Trends in WM detection accuracy across Experiments 1 and 2. Visualization of classification accuracy trends (Tables 3 and 4). 
^(Xge) (consensus embedding) performs significantly better than comparative strategies (original feature space, GE, MDS); using median as 
ML estimator (purple) may be marginally more consistent than using mean as ML estimator (orange). ¥(F) (blue) and *F(X MDS ) (red) perform 
similarly (corresponding trends directly superposed on one another). 
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and inhomogeneity as compared to either of YfF) and 
^{X-mds)- *(Xge) was seen to significantly improve over 
all of ^(F), ¥{X MDS ), and Y(X G£ ), reflecting the advan- 
tages of consensus embedding. 

Experiment 2: Comparison of ML estimators 

WM pixel-level detection accuracy results for consensus 
embedding using two different ML estimators (median 
and mean) were averaged over all 10 MNI brain images 
considered and summarized in Table 4, for each of the 
18 combinations of noise and inhomogeneity (total of 
180 experiments). We see that the accuracy values are 
generally consistent across all the experiments con- 
ducted. No statistically significant difference in classifier 
performance was observed when using tyfX^) and 
*(Xg£ M ")' It would appear that ^(Xg^) is less suscepti- 
ble to higher noise and bias field levels compared to 
*( X C£T") ( trends in Fi g ure 3 )- 

Experiment 3: Clinical Prostate MRI data 

Figure 4 shows qualitative results of the ConsEmbedMS 
algorithm in detecting prostate cancer (CaP) on T2- 
weighted MRI, each row corresponding to a different 
2D MRI image. Comparing the pixel-level CaP detection 
results (visualized in green) in Figures 4(c) and 4(g) to 



Table 4 Comparing the mean and median as ML 
estimators within CalcConsEmbed 



Noise 


Inhomogeneity 






0% 


0% 
20% 
40% 


66.86 ± 2.89 
61.65 ± 4.58 
64.28 ± 5.93 


66.89 ± 2.91 
65.34 ± 4.12 

63.39 ± 6.51 


1% 


0% 
20% 
40% 


80.62 ± 1.03 

73.07 ± 8.97 
66.46 ± 9.80 


8045 ± 1 .07 
77.81 ± 0.96 
70.56 ± 7.15 


3% 


0% 
20% 
40% 


85.38 ± 0.75 
84.61 ± 0.81 

79.19 ± 7.56 


85.53 ± 0.84 

84.49 ± 0.76 
81.37 ± 1.39 


5% 


0% 
20% 
40% 


89.68 ± 1 .36 
86.81 ± 1.38 
81.67 ± 1.51 


90.85 ± 1 .32 
87.01 ± 1.83 
81.82 ± 1.32 


7% 


0% 
20% 
40% 


87.81 ± 0.73 
86.07 ± 1 .05 

81.53 ± 1.57 


86.17 ± 6.11 
82.73 ± 8.23 
81.72 ± 1.47 


9% 


0% 
20% 
40% 


75.51 ± 14.35 
78.18 ± 9.86 
78.18 ± 9.86 


74.32 ±16.11 
73.63 ± 12.75 
73.63 ± 12.75 



Pixel-level WM detection accuracy and standard error averaged over 10 MNI 
brain images and for 18 combinations of noise and inhomogeneity (180 
experiments) with each of the 2 ML estimators considered in CalcConsEmbed: 

(i)median (*(^)> (2 '(*(^r)> 



the green CaP masks in Figures 4(b) and 4(f), obtained 
by registering the MRI images with corresponding his- 
tology images [50] (not shown), reveals that YfX^) 
results in a large false positive error. In contrast, *(Xg£) 
(Figures 4(d) and 4(h)) appears to better identify the 
CaP region when compared to the ground truth for CaP 
extent in Figures 4(b) and 4(f). Figure 5 illustrates the 
relative pixel-level prostate cancer detection accuracies 
averaged across 16 MRI slices for the 2 methods com- 
pared. *(Xge) was found to significantly [p <0.05) out- 
perform ^(Xg^) in terms of accuracy and specificity of 
CaP segmentation over all 16 slices considered. 

Experiment 4: Classification of Gene-Expression Data 

Table 5 summarizes classification accuracies for each of 
the strategies compared: supervised consensus-PCA and 
consensus-GE ^(Xp^} *(X^ E )> respectively), unsuper- 
vised consensus-PCA and consensus-GE (*l/(Xp^ A ) 
*(X^> respectively), SSAGE Q¥(X S sge)), as well as 
supervised classification of the original feature space 
(^(F)). These results suggest that consensus embedding 
yields a superior classification accuracy compared to 
alternative strategies. We posit that this improved per- 
formance is due to the more accurate representation of 
the data obtained via consensus embedding. 

The presence of a large noisy, high-dimensional space 
was seen to adversely affect supervised classification per- 
formance of F, which yielded a worse classification accu- 
racy than unsupervised classification (of consensus-GE 
and consensus-PCA) in 3 out of the 4 datasets. Moreover, 
semi-supervised DR, which utilized label information to 
construct X SSG£ , was also seen to perform worse than con- 
sensus embedding (both supervised and unsupervised var- 
iants). We posit that this is because SSAGE does not 
explicitly account for noise, but only modifies the pairwise 
relationships between points based on label information 
(possibly exacerbating the effects of noise). By contrast, 
any label information used by consensus embedding is 
used to account for noisy samples when approximating 
the "true" pairwise relationships between points. The dif- 
ference in the final embedding representations can be 
visualized in 3D in Figure 6, obtained by plotting all the 
samples in the lung cancer gene-expression dataset in 3D 
Eigen space. Note that consensus DR (Figures 6(b)-(e)) 
shows significantly better separation between the classes 
with more distinct, tighter clusters as well as fewer false 
positives compared to SSAGE (Figure 6(a)). 

Further, comparing the performance of supervised 

(*(Xp CA ), *(X S G£ )) and unsupervised (*(Xp* A ), *(X^)) 
variants of consensus embedding demonstrates compar- 
able performance between them, though a supervised 
measure of embedding strength shows a trend towards 
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Figure 4 Prostate cancer detection results on ex vivo MRI data, (a), (e) 2D sections from 3D prostate MR I data, and (b), (f) corresponding 
CaP masks superposed on the MRI, obtained via deformable registration with the corresponding histology slice (not shown) [50]. Corresponding 
CaP detection results via (c), (g) *P(Xge) (graph embedding), and (d), (h) ^(Xgg) (consensus embedding) are superposed back onto the original 
MRI sections ((a), (e)). In each of (b)-(d) and (f)-(h), green denotes the CaP segmentation region. Note the significantly fewer false positives in (d) 
and (h) compared to (c) and (g) respectively. 



being more consistent. The relatively high performance 
of ^(Xp^) an d *(Xg£) demonstrate the feasibility of a 
completely unsupervised framework for consensus 
embedding. 

For both consensus PCA and consensus GE we tested 
the parameter sensitivity of our scheme by varying the 



number of feature subsets generated (M e {200, 500, 
1000}) in the CreateEmbed algorithm (Tables 6 &7). 
The relatively low variance in classification accuracy as 
a function of M reflects the invariance to parameters of 
consensus embedding. No consistent trend was seen in 
terms of either of consensus-GE or consensus-PCA out- 
performing the other. 



100.00% 
90.00% 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 



CaP Classification Accuracy 




72.29% 


58.54% 













■ Graph Embedding ■ Consensus Embedding 

Figure 5 Prostate cancer detection accuracy on ex vivo MRI data. Pixel-level classification accuracy in identifying prostate cancer on T2- 
weighted MRI, averaged over 16 2D MRI slices for *P(X Gf ) (blue) and *(Xc £ ) (red). 
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Table 5 Classification accuracies for different representation strategies for gene-expression data 



Dataset 




Acc * 

O^ssge) 


</> Acc (Xpca) 




</> Acc p4) 


<P Acc ^ U ge) 


Prostate Tumor 


73.53 


73.53 


97.06 


100 


100 


76.47 


Breast Cancer Relapse 


68.42 


63.16 


63.16 


57.89 


63.16 


57.89 


Lung Cancer 


89.93 


10.07 


99.33 


96.64 


98.66 


100 


Lymphoma 


58.82 


61.76 


97.06 


76.47 


97.06 


67.65 



Classification accuracies for testing cohorts of 4 different binary class gene-expression datasets, comparing (1) supervised random forest classification of original 
feature space (F), (2)_unsupervised hierarchical clustering of semi-supervised DR space (X SSGf ), and (3) unsupervised hierarchical clustering of consensus 
embedding space (X G£ , Xpca} 



Conclusions 

We have presented a novel dimensionality reduction 
scheme called consensus embedding which can be used 
in conjunction with a variety of DR methods for a wide 
range of high-dimensional biomedical data classification 
and segmentation problems. Consensus embedding 
exploits the variance within multiple base embeddings 
and combines them to produce a single stable solution 
that is superior to any of the individual embeddings, 
from a classification perspective. Specifically, consensus 
embedding is able to preserve pairwise object-class rela- 
tionships from the high- to the low-dimensional space 
more accurately compared to any single embedding 
technique. Using an intelligent sub-sampling approach 
(via mean-shift) and code parallelization, computational 
feasibility and practicability of our method is ensured. 

Results of quantitative and qualitative evaluation in 
over 200 experiments on toy, synthetic, and clinical 
images in terms of detection and classification accuracy 
demonstrated that consensus embedding shows signifi- 
cant improvements compared to traditional DR methods 
such as PCA. We also compared consensus embedding 
to using the feature space directly, as well as to using an 
embedding based on distance preservation directly from 
the feature space (via MDS [21]), and found significant 
performance improvements when using consensus 
embedding. Even though the features and classifier used 
in these experiments were not optimized for image 



segmentation purposes, consensus embedding outper- 
forms state-of-the-art segmentation schemes (graph 
embedding, also known as normalized cuts [6]), differ- 
ences being statistically significant in all cases. Incorpor- 
ating spatial constraints via algorithms such as Markov 
Random Fields [54] could be used to further bolster the 
image segmentation results via consensus embedding. 

In experiments for high-dimensional biomedical data 
analysis using gene-expression signatures, consensus 
embedding also demonstrated improved results com- 
pared to semi-supervised DR methods (SSAGE [52]). 
Evaluating these results further illustrates properties of 
consensus embedding: (1) the consensus of multiple 
projections improves upon any single projection (via 
either linear PCA or non-linear GE), (2) the error rate 
for consensus embedding is not significantly affected by 
parameters associated with the method, as compared to 
traditional DR. Finally, the lower performance of a 
supervised classifier using the original noisy feature 
space as compared to using the consensus embedding 
representation demonstrates the utility of DR to obtain 
improved representations of the data for classification. 

It is however worth noting that in certain scenarios, 
consensus embedding may not yield optimal results. For 
instance, if very few embeddings are selected for consen- 
sus, the improvement in performance via consensus 
embedding over simple DR techniques may not be as 
significant. This translates to having a sparsely 








Figure 6 Visualization of 3D embeddings for gene-expression data (breast cancer). 3D visualization of embedding results for lung cancer 
gene-expression data: (a) X 5SG6 (b) X^. W W) XpQV Xp^ • ^ ne ^ axes corres P on d to the primary 3 eigenvalues obtained via 
different DR methods (SSAGE, consensus-GE ana consensus- PCA), while the colors of the objects (red and blue) are based on known class 
information (cancer and non-cancer, respectively). Note the relatively poor performance of (a) semi-supervised DR compared to (b)-(e) consensus 
DR. Both supervised ((b) and (d)) and unsupervised ((c) and (e)) consensus DR show relatively consistent separation between the classes with 
distinct, tight clusters. The best clustering accuracy for this dataset was achieved by (c) unsupervised consensus GE (Xg^) 
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Table 6 Variation in classification accuracy as a function of parameters for consensus-PCA on gene-expression data 



Dataset 




</>^(X PCA ) 












M = 200 


M = 500 


M = 1000 


M = 200 


M = 500 


M = 1000 


Prostate Tumor 


97.06 


97.06 


97.06 


1 00 


100 


100 


Breast Cancer Relapse 


57.89 


63.16 


57.89 


57.89 


57.89 


52.63 


Lung Cancer 


99.33 


99.33 


99.33 


96.64 


95.97 


96.64 


Lymphoma 


94.12 


97.06 


97.06 


76.47 


67.65 


61.76 



Classification accuracies for testing cohorts of 4 different binary class gene-expression datasets for XpQ^ an< ^ Xp^' w hi' e varying the number of subsets M 
generated within CreateEmbed. 



populated distribution for the estimation of the consen- 
sus pairwise relationship, resulting in a lower confidence 
being associated with it. Such a scenario may arise due 
to incorrectly specified selection criteria for embeddings; 
however, it is relatively simple to implement self-tuning 
for the selection parameter (9). Note we have reported 
results for a fixed value of 8 in all our experiments/ 
applications, further demonstrating the robustness of 
our methodology to choice of parameters. 

In this work, consensus embedding has primarily been 
presented within a supervised framework (using class 
label information to evaluate embedding strength). Preli- 
minary results in developing an unsupervised evaluation 
measure using the R-squared cluster validity index [40] 
are extremely promising. However, additional tuning 
and testing of the measure is required to ensure 
robustness. 

Another area of future work is developing algorithms 
for the generating uncorrelated, independent embed- 
dings. This is of great importance as generating truly 
uncorrelated, independent embeddings will allow us to 
capture the information from the data better, hence 
ensuring in an improved consensus embedding result. 
As mentioned previously, methods to achieve this could 
include varying the parameter associated with the DR 
method (e.g. neighborhood parameter in LLE [5]) as 
well as the feature space perturbation method explored 
in this paper. These approaches are analogous to meth- 
ods of generating weak classifiers within a classifier 



ensemble [18], such as varying the k parameter in ANN 
classifiers [39] or varying the training set for decision 
trees [37]. Note our feature space perturbation method 
to generate multiple, uncorrelated independent embed- 
dings is closely related to the method used in random 
forests [37] to generate multiple weak, uncorrelated clas- 
sifiers. Thus the embeddings we generate, as with the 
multiple classifiers generated in ensemble classifier 
schemes, are not intended to be independent in terms 
of information content, but rather in their method of 
construction. 

The overarching goal of consensus embedding is to 
optimally preserve pairwise relationships when project- 
ing from high- to low-dimensional space. In this work, 
pairwise relationships were quantified by us using the 
popular Euclidean distance metric. This was chosen as it 
is well understood in the context of these methods used 
within our algorithm (e.g. the use of MDS). Alternative 
pairwise relationship measures could include the geode- 
sic distance [7] or the symmetrized Kullback-Leibler 
divergence [55]. It is important to note that such mea- 
sures will need to satisfy the properties of a metric to 
ensure that they correctly quantify both triangle as well 
as pairwise relationships. We currently use MDS [21] to 
calculate the final consensus embedding (based on the 
consensus pairwise distance matrix). We have chosen to 
use MDS due to ease of computational complexity, but 
this method could be replaced by a non-linear variant 
instead. Finally, our intelligent sub-sampling approach 



Table 7 Variation in classification accuracy as a function of parameters for consensus-GE on gene-expression data 



Dataset 
















M = 200 


M = 500 


M = 1000 


M = 200 


M = 500 


M = 1000 


Prostate Tumor 


100 


100 


97.06 


76.47 


76.47 


76.47 


Breast Cancer Relapse 


57.89 


57.89 


57.89 


57.89 


57.89 


57.89 


Lung Cancer 


98.66 


98.66 


97.99 


100 


100 


90.60 


Lymphoma 


61.76 


97.06 


55.88 


67.65 


67.65 


67.65 



Classification accuracies for testing cohorts of 4 different binary class gene-expression datasets for anc ' Xq^' wn '' e varying the number of subsets M 
generated within CreateEmbed. 
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to ensure computational feasibility comes with a caveat 
of the out-of-sample extension problem [56]. We cur- 
rently handle this using a mapping of results from high 
to low-resolutions, but are currently identifying more 
sophisticated solutions. We intend to study these areas 
in greater detail to further validate the generalizability of 
consensus embedding. 

Appendix A: Correspondence between Equation 1 
and Definition 2 

In order to calculate the probability p(A\c, d, e, R"), we 
utilize the traditional formulation of a prior probability, 



total number of observed instances 
total number of instances 



(3) 



With reference to Equation 1, "instances" are triplets. 
Therefore Equation 3 becomes, 



(1(A)- 



total number of preserved triplets (i.e. A = 1) J^c ^{ c > 1 



total number possible triplets 



(4) 



Independent of the above, we intuitively arrived at a 
mathematical formulation for embedding strength. 

The strength of any embedding R" will depend on 
how well pairwise relationships are preserved from R . 

This in turn can written in terms of the triplet rela- 
tionship as well, 

^j^, total number of preserved triplets ^ c h.[c, d, e) 

total number possible triplets Z 

Appendix B: Properties of consensus embedding 

The following proposition will demonstrate that fen will 
have a lower inherent error in its pairwise relationships 
compared to the strong embeddings RjJ,fe € {1, ... ,K\, 
used in its construction. Note that relevant notation and 
definitions have been carried over from Section 2. 

We first define the mean squared error (MSE) in the 
pairwise relationship between every pair of objects c, d 
e C in any embedding R H with respect to the true pair- 
wise relationships in fen as, 



ex = Zcd{S cd - O 



(6) 



where E cc i is the expectation of the squared error in 
the pairwise relationships in R* calculated over all pairs 
of objects c, d e C. We can hence calculate the 
expected MSE over all K base embeddings specified 
above as, 



s K ,x = E cd (S cd ) 2 - 2E cd (8 cd )E K (8 cd ) + E cd E K {8$ f 



Now,Ejrpf) 2 > (E K 8 cd f, 

> E cd (S cd ) 2 - 2E cd (8 cd ){8 ai ) + E cd {8 cd f 

> By 



(7) 



Given K observations 8 cd , fee [1, ... ,K] (derived from 
selected base embeddings we define the pairwise 
relationship in the consensus embedding fen as 
8 cd = Ek (8 cd \ where E K is the expectation of 8 cd over K 
observations. The MSE in g cd with respect to the true 
pairwise relationships in fen may be defined as (similar 
to Equation 6), 



s x = E cA [8 cd - 8 ca f, 



Zcd\2 



(8) 



where E cd is the expectation of the squared error in 
the pairwise relationships in j>" calculated over over all 
pairs of objects c, d e C. It is clear that if for all c, d e 
C that fid _ frd, then fen is also a true embedding. 

Proposition 2 Given K independent, strong embed- 
dings, RjJ, fee {1, . . . , K], which are used to construct fen, 
Bk,x > s x . 

Proof. 

Expanding Equation 7, 



SK.X = Zcd{S Cd ) 2 



2E cd {8 cd )E K {8 cd ) + E cd E K {8?) 2 



Uow,E K {S c k y>{E K 8 c k y, 



> K cd {8 cd f 

> Si 



2E cd [8 cd ){8 cd ) + E cd {8 ca ) 



cd\2 



□ 



Proposition 2 implies that fen will never have a higher 
error than the maximum error associated with any indi- 
vidual strong embedding RJJ, fe e {!,..., K], involved in 
its construction. However if e kx is low, s x may not sig- 
nificantly improve on it. Similar to Bagging [17] where 
correlated errors across weak classifiers are preserved in 
the ensemble result, if the pairwise relationship 8 cd is 
incorrect across all K embeddings, s cd will be incorrect 
as well. However Proposition 2 guarantees that E x will 
never be worse than s KiX . 

Appendix C: Practical implementation of 
embedding strength 

While embedding strength may be seen as a generalized 
concept for evaluating embeddings, in this work we 
have examined applications of DR and consensus 
embedding to classifying biomedical data (Section 3). 
We now derive a direct relationship between embedding 
strength and classification accuracy, presented in Theo- 
rem 1 below. 

For the purposes of the following discussion, all 
objects c, d, e e C are considered to be associated with 
class labels /(c), 1(d), /(e) e {coi, co 2 }, respectively, such 
that if /(c) = 1(d) = a>i and 1(e) = co 2 then A cd <A ce and 
A cd <A de . Note that W\, co 2 are binary class labels that 
can be assigned to all objects c e C. 
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Definition 6 An unique triplet (c, d, e) e C with 1(c), I 
(d), 1(e) e {m lt oj 2 } is called a class triplet, (c, d, e)i, if 
either 1(c) * 1(d), or 1(d) * 1(e), or 1(c) * 1(e). 

Thus, in a class triplet of objects, two out of three 
objects have the same class label but the third has a dif- 
ferent class label, e.g. if 1(c) = 1(d) = co l and 1(e) = oj 2 - 
Further, in the specific case that A(c, d, e) = 1 for a 
class triplet (c, d, e)i, it will be denoted as A l (c, d, e). For 
the above example of a class triplet, we know that A cd 
<A ce and A cd <A de (see above). If A(c, d, e) = 1, S cd < 
d ce and cf d < 5 de . This implies that even after projection 
from R N to R", the class-b ased pairwise relationships 
within the data are accurately preserved (a classifier can 
be constructed which will correctly classify objects c, d, 
e in R n ). 

Consider that if | objects have class label coi, then 
^ S p R objects will have class label co 2 . Based on the total 
number of unique triplets Z, the total number of triplets 
which are not class triplets is, 



S ■ 



(S-l)K , 

s • 



3!(f-3)! 3!( 



<( (S-1)R 



3)! 



(9) 



J?" will be a constant for a given set of objects C, and is 
based on forming unique triplets (c, d, e) where 1(c) = I 
(d) = 1(e) (triplets which are not class triplets). U = (Z - 
Y) will correspond to the number of class triplets that 
may be formed for set C. If all U class triplets have A l (c, 
d, e) = 1, then it is possible to construct U classifiers 
which correctly classify the corresponding objects in 
these class triplets. 

Definition 7 Given U unique class triplets (c, d, e)i e 
C and an embedding R" of all objects c, d, e e C, the 
associated classification accuracy (p Acc (U n ) = £c A^crf,e) 

As illustrated previously, class triplets (c, d, e)i for 
which A ! (c, d, e) = 1 will correspond to those objects 
which will be classified correctly in R*. Therefore, the 
classification accuracy qr cc (R") may simply be defined 
as the fraction of class triplets (c, d, e); e C for which A' 
(c, d, e) = 1. 

Theorem 1 For any R", the corresponding \jr (R K ) 
increases monotonically as a function of q^ cc (J>l"). 
Proof. 

By definition, A(c, d, e) > A ( (c, d, e) 

c c 

™ „ y r A'(c,d,e) 
Dividing by Z = U + Y on either side, i/r ES (R") > ^ c 



Inverting, 



U+Y 
1 Y 



f ES (R") ~ 0 Acc (R") £ c A'(c,d,e) 



As j2 A*(c,d,e) * s a constant, ^ S (R") increases monoto- 
nically with ^ CC (R"). □ 



Thus an embedding R" with a high embedding 
strength will have a high classification accuracy. Practi- 
cally, this implies that y/ ES (R") may be estimated via any 
measure of object-class discrimination such as classifica- 
tion accuracy or cluster-validity measures. We have 
exploited this relationship in our algorithmic implemen- 
tation (Section 2). 

Endnotes 

1 http://www.bic.mni.mcgill.ca/brainweb/ 

These datasets were downloaded from the Biomedical 
Kent-Ridge Repositories at http://datam.i2r.a-star.edu.sg/ 
datasets/krbd/ 



Acknowledgements 

The authors would like to acknowledge Drs. Mark Rosen, John 
Tomasezewski, and Michael Feldman from the Hospital of the University of 
Pennsylvania for the use of ex vivo prostate MRI and histology data. They 
would also thank Andrew Janowczyk, Dr. Jonathan Chappelow, and Dr. 
James Monaco for results, discussions, and implementations used in this 
paper. This work was made possible via grants from the Department of 
Defense Prostate Cancer Research Program (W81XWH-08-1-0072), Wallace H. 
Coulter Foundation, National Cancer Institute (Grant Nos. R01CA1 36535-01, 
R01CA1 40772-01, and R03CA1 43991 -01), The Cancer Institute of New Jersey, 
and the Society for Imaging Informatics in Medicine. 

Authors' contributions 

AM and SV co-conceived the core algorithm and theoretical justifications of 
consensus embedding. SV further developed, evaluated, and refined the 
implementation and experiments. AM directed the research and the 
development of the manuscript. Both authors contributed to writing and 
editing, and have read and approved the final manuscript. 

Received: 23 August 2011 Accepted: 8 February 2012 
Published: 8 February 2012 

References 

1. Bellman R: Adaptive control processes: a guided tour Princeton University 
Press; 1961. 

2. Jolliffe I: Principal Component Analysis Springer; 2002. 

3. Lin T, Zha H: Riemannian Manifold Learning. IEEE Eransactions on Pattern 
Analysis and Machine Intelligence 2008, 30(5):796-809. 

4. Lee G, Rodriguez C, Madabhushi A: Investigating the Efficacy of Nonlinear 
Dimensionality Reduction Schemes in Classifying Gene- and Protein- 
Expression Studies. IEEE/ACM Eransactions on Computational Biology and 
Bioinformatics 2008, 5(3):1-17. 

5. Saul L, Roweis S: Think globally, fit locally: unsupervised learning of low 
dimensional manifolds. Journal of Machine Learning Research 2003, 
4:119-155. 

6. Shi J, Malik J: Normalized Cuts and Image Segmentation. IEEE Transactions 
on Pattern Analysis and Machine Intelligence 2000, 22(8):888-905. 

7. Tenenbaum J, Silva V, Langford J: A Global Geometric Framework for 
Nonlinear Dimensionality Reduction. Science 2000, 290(5500)2319-2323. 

8. Dawson K, Rodriguez R, Malyj W: Sample phenotype clusters in high- 
density oligonucleotide microarray data sets are revealed using Isomap, 
a nonlinear algorithm. BMC Bioinformatics 2005, 6(1 ): 1 95. 

9. Madabhushi A, Shi J, Rosen M, Tomaszeweski JE, Feldman MD: Graph 
embedding to improve supervised classification and novel class 
detection: application to prostate cancer. Proc 8th Int'l Conf Medical Image 
Computing and Computer-Assisted Intervention (MICCAI) 2005, 729-37. 

'10. Varini C, Degenhard A, Nattkemper T: Visual exploratory analysis of DCE- 
MRI data in breast cancer by dimensional data reduction: A comparative 
study. Biomedical Signal Processing and Control 2006, 1(1)56-63. 
1 1 . Quinlan J: The effect of noise on concept learning Morgan Kaufmann; 1 986. 



Viswanath and Madabhushi BMC Bioinformatics 2012, 13:26 
http://www.biomedcentral.eom/1 471 -2 1 05/1 3/26 



Page 20 of 20 



12. Balasubramanian M, Schwartz EL, Tenenbaum JB, de Silva V, Langford JC: 
The Isomap Algorithm and Topological Stability. Science 2002, 
295(5552):7a. 

1 3. Chang H, Yeung D: Robust locally linear embedding. Pattern Recognition 
2006, 39(6):1 053-1 065. 

14. Shao C, Huang H, Zhao L: P-ISOMAP: A New ISOMAP-Based Data 
Visualization Algorithm with Less Sensitivity to the Neighborhood Size. 
Dianzi Xuebao(Acta Electronica Sinica) 2006, 34(8): 1497-1 501. 

15. Fred A, Jain A: Combining Multiple Clusterings Using Evidence 
Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 
2005, 27(6):835-850. 

16. Freund Y, Schapire R: A decision-theoretic generalization of on-line 
learning and an application to boosting. Proc 2nd European Conf 
Computational Learning Theory 1 995, 23-37. 

17. Breiman L: Bagging predictors. Machine Learning 1996, 24(2):1 23-140. 

18. Dietterich T: Ensemble Methods in Machine Learning. Proc 1st Int'l 
Workshop on Multiple Classifier Systems 2000, 1-15. 

1 9. MacQueen J: Some Methods for Classification and Analysis of 
Multivariate Observations. Proc Fifth Berkeley Symposium on Mathematical 
Statistics and Probability 1967, 281-297. 

20. Fern X, Brodley C: Random Projection for High Dimensional Data 
Clustering: A Cluster Ensemble Approach. Proc 20th Int'l Conf Machine 
Learning 2003, 186-193. 

21. Venna J, Kaski S: Local multidimensional scaling. Neural Networks 2006, 
19(6):889-899. 

22. Samko O, Marshall A, Rosin P: Selection of the optimal parameter value 
for the Isomap algorithm. Pattern Recognition Letters 2006, 27(9)368-979. 

23. Kouropteva O, Okun O, Pietikainen M: Selection of the Optimal Parameter 
Value for the Locally Linear Embedding Algorithm. Proc 1st Int'l Conf 
Fuzzy Systems and Knowledge Discovery 2002, 359-363. 

24. de Silva V, Tenenbaum J: Global Versus Local Methods in Nonlinear 
Dimensionality Reduction. Proc 15th Conf Adv Neural Information Processing 
Systems (NIPS) 2003, 705-712. 

25. Zelnik-Manor L, Perona P: Self-tuning spectral clustering. Proc 17th Conf 
Adv Neural Information Processing Systems (NIPS) 2004, 1601-1608. 

26. Geng X, Zhan DC, Zhou ZH: Supervised nonlinear dimensionality 
reduction for visualization and classification. IEEE Transactions on Systems, 
Man, and Cybernetics: Part B, Cybernetics 2005, 35(6):1 098-1 07. 

27. de Ridder D, Kouropteva O, Okun O, Pietikainen M, Duin R: Supervised 
Locally Linear Embedding. Proc Artificial Neural Networks and Neural 
Information Processing 2003, 333-341. 

28. Athitsos V, Alon J, Sclaroff S, Kollios G: BoostMap: An Embedding Method 
for Efficient Nearest Neighbor Retrieval. IEEE Transactions on Pattern 
Analysis and Machine Intelligence 2008, 30(1)89-104. 

29. Lawrence N: Spectral Dimensionality Reduction via Maximum Entropy. 
Proc 14th Intn'l Conf Artificial Intelligence and Statistics (AISTATS) 201 1, 51-59. 

30. Mao K, Liang F, Mukherjee S: Supervised Dimension Reduction Using 
Bayesian Mixture Modeling. Proc 13th Intn'l Conf Artificial Intelligence and 
Statistics (AISTATS) 2010, 501-508. 

31. Blum A, Mitchell T: Combining labeled and unlabeled data with co- 
training. Proc I ith Annual Conf Computational Learning Theory 1 998, 
92-100. 

32. Hou C, Zhang C, Wu Y, Nie F: Multiple view semi-supervised 
dimensionality reduction. Pattern Recognition 2009, 43(3):720-73O 

33. Wachinger C, Yigitsoy M, Navab N: Manifold Learning for Image-Based 
Breathing Gating with Application to 4D Ultrasound. Proc 13th Int'l Conf 
Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2010, 
26-33. 

34. Wachinger C, Navab N: Manifold Learning for Multi-Modal Image 
Registration. Proc Uth British Machine Vision Conference (BMVC) 2010, 
82.1-82.12. 

35. Cover T, Hart P: Nearest neighbor pattern classification. IEEE Transactions 
on Information Theory 1967, 13(1)21-27. 

36. Tiwari P, Rosen M, Madabhushi A: Consensus-locally linear embedding (C- 
LLE): application to prostate cancer detection on magnetic resonance 
spectroscopy. Proc 1 1th Int'l Conf Medical Image Computing and Computer- 
Assisted Intervention (MICCAI) 2008, 330-8. 

37. Ho TK: The random subspace method for constructing decision forests. 
IEEE Transactions on Pattern Analysis and Machine Intelligence 1 998, 
20(8):832-844. 



38. Levina E, Bickel P: Maximum Likelihood Estimation of Intrinsic Dimension. 

Proc 17th Conf Adv Neural Information Processing Systems (NIPS) 2005, 
777-784. 

39. Kuncheva L: Combining pattern classifiers: methods and algorithms Wiley- 
nterscience; 2004. 

40. Halkidi M, Batistakis Y, Vazirgiannis M: On Clustering Validation 
Techniques. Journal of Intelligent Information Systems 2001, 17(2)107-145. 

41 . Patel JK, Read CB: Handbook of the normal distribution Marcel Dekker; 1 996. 

42. Yang C, Duraiswami R, Gumerov NA, Davis L: Improved fast gauss 
transform and efficient kernel density estimation. Proc 9th IEEE Int'l Conf 
Computer Vision (ICCV) 2003, 664-671 . 

43. Comaniciu D, Meer P: Mean shift: a robust approach toward feature 
space analysis. IEEE Transactions on Pattern Analysis and Machine 
Intelligence 2002, 24(5)603-619. 

44. Haralick RM, Shanmugam K, Dinstein I: Textural Features for Image 
Classification. IEEE Transactions on Systems, Man and Cybernetics 1973, 
3(6):61 0-621. 

45. Madabhushi A, Feldman M, Metaxas D, Tomaszeweski J, Chute D: 
Automated Detection of Prostatic Adenocarcinoma from High- 
Resolution Ex Vivo MRI. IEEE Transactions on Medical Imaging 2005, 
24(1 2):1 611 -1625. 

46. Herlidou-Meme S, Constans JM, Carsin B, Olivie D, Eliat PA, Nadal- 
Desbarats L, Gondry C, Le Rumeur E, Idy-Peretti I, de Certaines JD: MRI 
texture analysis on texture test objects, normal brain and intracranial 
tumors. Magnetic Resononance Imaging 2003, 21(9)989-993. 

47. Dai II, Lieu L, Rocke D: Dimension reduction for classification with gene 
expression microarray data. Statistical Applications in Genetics and 
Molecular Biology 2006, 5(1 ):Article6. 

48. Carballido-Gamio J, Belongie S, Majumdar S: Normalized cuts in 3-D for 
spinal MRI segmentation. IEEE Transactions on Medical Imaging 2004, 
23(1)36-44. 

49. Kwan R, Evans A, Pike G: MRI simulation-based evaluation of image- 
processing and classification methods. IEEE Transactions on Medical 
Imaging 1999, 18(1 1):1 085-1 097. 

50. Chappelow J, Bloch BN, Rofsky N, Genega E, Lenkinski R, DeWolf W, 
Madabhushi A: Elastic registration of multimodal prostate MRI and 
histology via multiattribute combined mutual information. Medical 
Physics 2011, 38(4)2005-2018. 

51. Liu H, Li J, Wong L: A comparative study on feature selection and 
classification methods using gene expression profiles and proteomic 
patterns. Genome Informatics 2002, 13:51-60. 

52. Zhao H: Combining labeled and unlabeled data with graph embedding. 
Neurocomputing 2006, 69(16)2385-2389. 

53. Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of 
genome-wide expression patterns. Proceedings of the National Academy of 
Sciences of the United States of America 1998, 95(25)14863-14868. 

54. Monaco JP, Tomaszewski JE, Feldman MD, Hagemann I, Moradi M, 
Mousavi P, Boag A, Davidson C, Abolmaesumi P, Madabhushi A: High- 
throughput detection of prostate cancer in histological sections using 
probabilistic pairwise Markov models. Medical Image Analysis 2010, 
14(4)617-629. 

55. Moakher M, Batchelor PG: Symmetric Positive-Definite Matrices: From 
Geometry to Applications and Visualization 2006. 

56. Bengio Y, Paiement J, Vincent P, Delalleau 0, Le Roux N, Ouimet M: Out-of- 
Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral 
Clustering. Proc 16th Conf Adv Neural Information Processing Systems (NIPS) 
2004, 177-184. 



dohlO.l 186/1471-2105-13-26 

Cite this article as: Viswanath and Madabhushi: Consensus embedding: 
theory, algorithms and application to segmentation and classification of 
biomedical data. BMC Bioinformatics 2012 13:26. 



